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The  multiplicity  of  possible  hardware  implementations  for  a  given 
computational  scheme  is  efficiently  displayed  by  a  space- time 
representation,  a  notational  tool  that  has  been  incorporated  into  some 
recent  methodologies  for  modeling,  analysis  and  design  of  parallel 
architectures  (1-91.  Coordinate  transformations  of  a  given  space-time 
representation  produce  distinct  hardware  configurations  which  are  equivalent 
in  the  sense  of  being  the  implementations  of  the  same  computational  scheme. 
The  problem  of  mapping  a  given  algorithm  into  a  desired  hardware 
configuration  can,  therefore,  be  partly  reduced  to  choosing  the  appropriate 
coordinate  transformation  in  space-time.  In  particular,  uniform  recurrence 
relations,  which  correspond  to  systolic-array  architectures,  are  described 
by  regular  space-time  representations.  This  implies  that  only  linear 
coordinate  transformations  are  required,  and  that  the  entire  computational 
scheme  can  be  described  by  a  small  collection  of  vectors  in  space-time,  the 
dependence  vectors  [3,5,6,81.  Consequently,  the  selection  of  a  desired 
hardware  architecture  for  a  given  algorithm  reduces  to  the  determination  of 
an  appropriate  nonsingular  matrix  with  integer  entries. 

Previous  research  has  focused  upon  the  algebra  of  such  transformation 
matrices  in  multidimensional  linear  spaces,  establishing  conditions  for  the 
mappability  of  given  algorithms  into  systolic-array  architectures.  However, 
since  physical  space  is  3-dimensional  the  dimensionality  of  space-time 
cannot  exceed  4.  Moreover,  since  VLSI  hardware  must  be  planar  and  no 
intersection  of  wires  is  allowed,  only  3-dimensional  space-time 
representations  are  allowed.  Consequently,  the  number  of  distinct  systolic- 
array  architectures  is  very  small;  Miranker  and  Winkler  have,  in  fact,  shown 
that  only  three  topologies — linear,  rectangular,  hexagonal — are  permitted 
_  for  systolic  arrays. 

This  paper  establishes  a  simple  technique  for  transforming  a  given  3- 
dimensional  space-time  representation  into  an  equivalent  canonical  form.  A 
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catalogue  ot  canonical  terms  is  constructed,  showing  a  total  ol  34  distinct 
systolic  architectures.  The  task  of  selecting  an  appropriate  transformation 
for  a  given  srace-time  representation  reduces,  therefore,  to  the 
determination  of  the  equivalent  canonical  form.  The  important  result,  which 
has  been  overlooked  in  previous  research,  is  that  the  canonical  equivalent 
of  any  given  space-time  representation  is  unique .  This  means  that  once  a 
space-time  representation  has  been  specified  there  is  no  flexibility  left  in 
the  process  of  mapping  into  systolic-array  architectures. 

A  small  fraction  of  space-time  representation  do  allow  some  flexibility 
in  selecting  the  hardware  architecture,  but  only  at  the  cost  of  inefficient 
implementation.  The  well-known  example  of  matrix  multiplication,  which  has 
four  distinct  realizations  (see  [11]-[14])  turns  out  to  be  one  of  the  few 
cases  where  such  flexibility  is  available.  A  closer  examination  of  the 
structure  of  the  matrices  to  be  multiplied  reveals  that  each  realization  is 
efficient  under  a  different  set  of  structural  assumptions, (see  Section  4.3). 
Thus,  in  summary,  carefully  specified  algorithms  lead  to  unique  space-time 
representations  which,  in  turn,  lead  to  essentially  unique  architectures. 


i 
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SECTION  2 


COMPLETELY  RECULAR  MCNs 

A  modular  computing  network  (MCN)  can  be  loosely  defined  as  an 
association  of  multivariable  functions  with  the  vertices  of  a  directed 
acyclic  graph  (see  [7]  for  a  rigorous  definition).  More  precisely, 
multivariable  functions  are  associated  only  with  internal  vertices,  which 
are  those  vertices  that  have  both  incoming  and  outgoing  arcs.  A  vertex  with 
incoming  arcs  and  pQ  outgoing  arcs  is  associated  with  an  input-output 
map  with  p^  input  variables  and  p^  output  variables. 

A  completely  regular  MCN  is  one  that  can  be  represented  by  a  regular 
multidimensional  grid,  and  in  »hich  all  input-output  maps  associated  with 
the  vertices  are  identical.  Thus,  the  vertices  of  a  completely  regular  MCN 
can  be  mapped  into  points  of  the  multidimensional  grid  Z&  in  the 
n-dimensional  Euclidean  space  R&,  where  Z  denotes  the  set  of  integers; 
the  arcs  of  a  completely  regular  MCN  become  vectors  (n-tnples  of  real 
numbers)  representing  the  directed  straight  lines  connecting  points  of  the 
grid  Zn.  Clearly,  not  all  points  in  Z&  correspond  to  vertices  of  the 
MCN.  Those  that  do  determine  the  domain  r  of  the  MCN  in  Z&.  The 
requirement  of  complete  regularity  translates  into  the  statement  that  the 
vectors  (arcs)  emanating  from  any  point  (vertex)  in  T  do  not  depend  upon 
the  choice  of  vertex.  Consequently,  the  entire  MCN  is  characterized  by: 

(i)  the  set  of  dependence  vectors  {d^  emanating  from  a  single 
vertex; 

(ii)  the  domain  F  C  Z°; 

and 
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(ill)  tlit-  input  unigut  nag  i 


f:  (xi . V  (yi . V 

associated  with  every  vertex  in  the  domain  V. 

A  curious  consequence  of  this  definition  is  that  the  input-output  map  f 
has  the  same  number  of  inputs  and  outputs,  since  the  number  of  arcs 
emanating  from,  a  point  in  F  is  always  the  same  as  the  number  of  arcs 
converging  to  a  point. 

Not  every  set  of  dependence  vectors  {d.}  determines  a  valid  MCN.  For 
instance,  the  directed  graph  representing  an  MCN  has  to  be  acyclic.  In 
terms  of  dependence  vectors  this  means  that  it  is  impossible  to  find 
positive  integers  (k.)  such  that  ^  k^d^  =  Another  requirement  is 
that  the  ancestry  of  every  vertex  v  e  F  (i.e.,  the  set  of  all  points  from 
which  v  can  be  reached)  has  to  be  finite.  This  constraint  is  trivial  if 
T  is  a  finite  set;  however,  if  F  is  infinite  (as  is  often  the  case  with 
signal  processing  algorithms)  this  constraint  implies  that  T  has  to  be 

bounded  in  the  directions  (— d . ) . 

1  3 

In  the  sequel  we  shall  focus  upon  completely  regular  MCNs  in  Z  , 

because  such  MCNs  correspond  to  space-time  representation  of  planar 

systolic-array-like  architectures  (see  13]  -  [7]).  We  shall  impose  the 

constraint  of  causality  resulting  from  the  association  of  'time'  with  one  of 

3 

the  coordinate  axes  in  Z  and  examine  the  flow  of  data  through  the 
architecture  in  terms  of  the  dependence  vectors  characterizing  the  MCN. 


2.1  SPACE-TIME  REPRESENTATIONS  IN  Z3 

3 

MCNs  in  Z  are  characterized  by  3-dimensional  dependence  vectors 
{d^},  which  we  shall  represent  by  row  vectors  of  length  3.  The  collection 
of  all  dependence  vectors 

D  IVl-1  (2>1) 
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forms  a  j>xn  matrix,  which  wv  shall  call  the  dc  pe  iidt- nee  matrix,  Tht 
hounduiy  H  tl.e  doriiiii.  cun  always  hi  described  as  a  pol  y  hi  di  i*n .  It  will 

be  sufficient  for  otu  purposes  to  consider  only  convex  polyhtdiit,  and  it. 
fact,  only  those  that  can  be  characterized  in  terms  of  the  dependence 

vectors  (see  Section  3.4  for  a  further  discussion  of  this  choice). 

3 

The  interj>retation  of  KCN's  in  Z  as  space-time  representations  of 
hardware  architectures  imposes  the  additional  constraint  of  causal itv : 
every  dependence  vector  must  have  a  positive  time  coordinate,  since 
computation  and  propagation  of  data  cannot  be  accomplished  in  zero  time. 
Moreover,  since  data  cannot  propagate  faster  than  the  speed  of 
electromagnetic  waves  in  metallic  conductors,  the  directions  of  dependence 
vectors  must  lie  within  a  certain  cone,  the  time-like  cone  in  the  space-time 
continuum.  By  appropriate  scaling  of  space  and  time  coordinates  we  can 
reduce  this  condition  to  the  requirement 

d.  [0  0  1]*  >  1  (2.2) 

which  means  that  the  third  coordinate  of  d^  must  be  (an  integer)  larger  or 
equal  to  1. 

The  association  of  time  with  the  third  coordinate  of  dependence  vectors 

allows  us  to  express  the  finite  ancestry  condition  in  simple  form.  The 

exclusion  of  ancestors  that  are  infinitely  remote  from  a  given  vertex  in  the 

domain  T  is  equivalent  to  the  requirement  that  T  be  a  subspace  of  the 

3 

positive  half  space  of  Z  ,  i.e.,  the  half  space  corresponding  to  non¬ 
negative  time  coordinates.  Moreover,  since  hardware  must  always  be  finite, 
the  spatial  extent  of  T  must  be  bounded.  Thus,  the  only  direction  in 
which  T  may  remain  unbounded  is  that  of  positive  time,  corresponding  to  a 
computation  that  continues  indefinitely  in  time  (e.g.,  a  filtering  of  an 
infinite  time-series),  but  produces  results  (outputs)  at  regular  intervals. 

Vertices  in  T  that  share  the  same  spatial  coordinates  are  considered 
as  representing  the  same  hardware  processor  at  different  instances  in  time. 
Regularity  implies  that  such  isospatial  vertices  are  spread  in  time  at 
regular  intervals.  This  interval,  which  is  the  same  for  all  processors, 
will  be  called  the  periodicity  index  of  the  architecture.  The  periodicity 
index  corresponding  to  a  given  dependence  matrix  D  is  the  smallest 
solution  n  of  the  equation 
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i|l)  n  I(>  0  1] 


(2.3) 


where  T|  is  any  row  vector  with  integer  (possibly  negative)  entries.  To 
I'rove  this  result  we  notice  that  ql)  is  an  integer  combination  of 
dependence  vectors;  moreover,  if  v(r^,y^,t^)  and  v^*2’^2't2^  are  tw0 
distinct  vertices  in  !’,  then  the  vector  connecting  these  vertices  can 
always  be  expressed  in  the  form  qD  for  an  appropriate  (possibly  nonunique) 
row  vector  q .  If  the  two  vertices  share  the  same  spatial  coordinates,  then 
their  interconnecting  vector  is  colinear  with  [0  0  1],  and  so  (2.3) 

satisfied  for  some  q,n.  Finally,  the  smallest  temporal  displacement  is 
obtained  when  rt  is  minimized  in  (2.3).  The  periodicity  index  n  can, 
actually,  be  evaluated  without  an  exhaustive  search  through  all  possible 
integer  vectors  of  q  that  satisfy  (2.3),  as  is  demonstrated  in  Section 
2.2. 

The  most  important  attribute  of  the  space-time  representation  of  a 
completely  regular  MCN  is  the  invariance  of  the  MCN  under  coordinate 
transformations  in  space-time.  This  is  so  because  coordinate 
transformations  do  not  affect  the  interconnection  pattern  of  the  space-time 
representation,  and  consequently  leave  the  corresponding  directed  graph 
unaltered.  In  the  case  of  regular  space-time  representations  it  is 
sufficient  to  consider  the  effect  of  linear  coordinate  transformations;  this 
is  done  in  detail  in  Sections  3  and  4, 


2.2  SPATIAL  PROJECTION  OF  MCNs  IN  Z3 


The  first  two  coordinates  in  a  three-dimensional  space-time  can  be 
interpreted  as  physical  space.  When  a  space-time  representation  is 
projected  into  the  plane  formed  by  the  first  two  coordinates,  vertices 
represent  computing  agents  (i.e.,  processors)  and  arcs  represent  physical 
interconnections  (i.e.,  wires).  The  projection  amounts  to  the  truncation  of 
each  dependence  vector  to  its  first  two  coordinates,  viz.. 


D 

s 


0 

1 

0 


(2.4) 
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ilie  truncated  dependence  maliix  (  '  s '  stands  lor  ’spatial')  is  usually 

sii  I  i  1 1  i  e  n  I  t  (i  ilnmli'i  1 i  I  lit  a  1 1  li  1 1  til  ii  i  i  ,  sine  c  v.  i  i  on  im  ii  1  >  a  s  s  un.t  1 1,  u  t 
encli  dependence  vector  represents  a  computation  tliut  requires  a  unit  of 
time,  and  consequently 


I)  - 


D 

s 


1 

1 

1 


(2.5) 


This  assumption  is  violated  only  when  D  has  a  periodicity  index  n(D)  >  1 
and,  in  addition,  D  contains  a  dependence  vector  of  the  form  [0  0  r] . 

This  dependence  vector  is  truncated  to  [0  0],  so  x  cannot  be  recovered 

unless  t  =  it  or  t  =  1.  These,  in  fact,  are  the  only  two  possible  values 
for  x  as  explained  in  Section  4.4. 

The  truncated  dependence  matrix  can  be  pictorially  represented  by  a 
conventional  block-diagram  such  as  Figure  2-1.  Each  truncated  dependence 
vector  is  represented  by  an  arc  with  the  appropriate  spatial  displacement, 
while  truncated  dependence  vectors  of  the  form  [0  0],  which  correspond 

to  local  memory,  are  represented  by  self-arcs. 


b.  Dependence  and  Truncated  Dependence  Matrices 
Figure  2-1.  Exaaiple  of  a  Regular  Hardware  Architecture 
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so  that  every  feasible  choice  of  t\  corresponds  to  an  undirected  loop  in 
the  2-dimensional  block-diagram  representation.  Thus,  every  feasible  T)  is 
obtained  by  considering  all  possible  loops  in  the  block-diagram 
i e presentation.  If  there  arc  no  self-loops  on  vertices,  then  1)^  contains 
no  zero  row  and  (2.5)  holds.  Consequently,  by  (2.3), 

t|D  [0  0  1]*  =  q[l  1  .  .  .  1]*  =  rt 

so  n  is  obtained  by  adding  up  the  entries  of  q.  This  is,  in  fact,  done 
by  counting  each  arc  along  the  loop  as  1  if  it  coincides  with  the 
orientation  of  the  loop  and  as  -1  if  it  points  in  the  reverse  direction. 
Since  the  smallest  value  of  n  is  required,  only  the  shortest  loops  need  to 
be  considered.  We  shall  show  in  Section  3.3  that  n  never  exceeds  3  and  is 
seldom  larger  than  1. 


S1.C1  JON  3 


CLASSIFICATION  01.  HAROV, AK_F_ ARQ1JTTCTIIRLS 


Completely  regular  MCNs  *cre  characterized  ir.  the  previous  section  in 
terms  of  their  dependence  vectors.  It  was  also  indicated  that  !•' CN s  with 
different  dependence  vectors  nay  nevertheless  be  equivalent,  namely  they 
will  have  equivalent  space-time  representations.  The  equivalence  of 
completely  regular  MCNs  is  easy  to  verify,  since  it  amounts  to  the  existence 
of  a  nonsingular  linear  transformation  relating  the  dependence  matrice  .f 
the  MCNs  in  consideration. 

The  study  of  equivalence  can  be  carried  out  at  several  different  vels 
of  abstraction.  At  the  lowest  (most  detailed)  level  each  completely  g  ir 
MCN  is  represented  by  a  dependence  matrix 


D 


[di]i=l 


(3.1) 


where  are  row  vectors  of  length  3  whose  first  two  coordinates  represent 

the  planar  space  of  integrated  circuits  and  the  third  coordinate  represents 
time.  Thus,  for  instance,  the  MCN  of  Figure  3-1  is  characterized  by  the 
dependence  matrix 


D  = 


1 

0 

1 


0  1 
1  1 
1  1 


Notice  that  the  time  coordinate  of  all  three  dependence  vectors  equals  to  1, 
reflecting  the  assumption  that  each  dependence  vector  represents  a 
computation  that  requires  a  unit  of  time.  This  assumption  can,  of  course, 
be  modified  to  incorporate  computations  with  unequal  processing  times. 

Notice  also  that  the  direction  of  dependence  vectors  coincides  with  that  of 
the  arrows  in  Figure  3-1,  pointing  toward  the  successors  of  a  given 
processor,  rather  than  toward  the  predecessors  of  the  same  processor,  as  in 
[3] . 
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Figure  3-1.  Example  of  a  Completely  Regular  MCN 


At  the  intermediate  level  of  abstraction  only  the  spatial  coordinates 
of  each  dependence  vector  are  considered.  This  results  in  the  elimination 
of  the  third  column  of  the  dependence  matrix  D,  resulting  in  the  truncated 
dependence  matrix 

1  0 
0  1 
1  1 

for  the  example  of  Figure  3-1.  We  shall  show  in  the  following  section  that 
the  truncated  dependence  matrix  Ds  provides,  in  fact,  a  complete,  albeit 
implicit,  characterization  of  the  MCN.  This  characterization  can  be 
transformed  in  a  unique  manner  into  the  explicit  characterization  D. 

At  the  highest  level  of  abstraction  only  the  topology  of  the  hardware 
is  considered.  This  means  that  the  directed  graph  representing  the  flow  of 
data  is  replaced  by  the  corresponding  non~directed  graph.  Thus,  for 
instance,  the  MCN  of  Figure  3-1  and  that  of  Figure  3-2  are  topologically 
equivalent,  even  though  the  latter  has  a  different  dependence  matrix,  viz. 

’  1  0 
0  1 
-1  -1 
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Figure  3-2.  A  Completely  Regular  NCN  Which  is  Topologically  Equivalent 

to  that  of  Figure  3-1 


This  section  is  devoted  to  the  study  of  topological  equivalence 
followed  by  the  study  of  architectural  (Dg)  equivalence.  The  more 
complicated  topic  of  space-time  equivalence  is  presented  in  the  following 
section,  where  it  is  also  shown  that  distinct  hardware  configurations  may, 
nevertheless,  have  equivalent  space-time  representations. 


3.1  TOPOLOGICAL  EQUIVALENCE 

The  topic  of  topological  equivalence  has  been  studied  by  Miranker  and 
Winkler  [3] ,  who  have  shown  that  there  are  only  three  distinct  topologies 
(Figure  3-3 ) : 

(1)  The  linear  topology,  with  a  single  dependence  vector. 


Dg  =  [1  0] 


(2)  The  rectangular  topology,  with  two  dependence  vectors. 


‘l  O' 

D  ■= 

‘  [o  1 


a.  The  L  inear  Topology 


;  - 


b.  The  Rectangular  Topology 


c.  The  Hexagonal  Topology 


Figure  3-3.  The  Three  Fundamental  Topologi 
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(?)  'Ihe  hexagonal  topology,  \*  l  t  >i  thief  dt  pc  ndc  me  vtclci  i, 


l.vcrj  sy  st  ol  i  c- array- 1  i  kc  architecture  can  be  related  by  a  linear 
transformation  to  one  of  these  fundamental  topologies.  Also,  it  is 
impossible  to  have  more  than  three  non- col ine ai  dependence  vectors  in  a 
planar  architecture. 

The  same  conclusion  can  be  reached  by  a  graph- theore ti c  argument.  The 
graph  describing  the  hardware  configuration  of  a  completely  regular  MCN  is 
clearly  a  mosaic,  i.e.,  a  planar  graph  in  which  all  faces  are  bounded  tbe 
same  number  of  edges  and  all  vertices  (except  those  on  the  external  boundary 
of  the  graph)  have  the  same  number  of  incident  edges.  As  is  well  known, 
there  are  only  three  possible  mosaics  [15]:  triangular,  rectangular  and 
hexagonal.  The  triangular  mosaic  has  vertices  of  degree  6  and  coincides 
with  the  hexagonal  topology.  The  rectangular  mosaic  has  vertices  of  degree 
4  and  coincides  with  the  rectangular  topology'.  The  hexagonal  mosaic  (Figure 
3-4)  does  not  correspond  to  a  completely  regular  MCN,  since  it  requires  two 
sets  of  dependence  vectors  rather  than  one.  However,  it  can  be  rearranged 
by  combining  pairs  of  adjacent  processors  into  a  single  processor  (Figure  3- 
4b),  so  that  the  resulting  configuration  has  a  rectangular  topology.  Thus, 
there  are  only  two  mosaics  corresponding  to  completely  regular  MCNs,  which 
combined  with  the  linear  configuration  makes  a  total  of  3  fundamental 
topologies . 


3 . 2  AKCII  ITKriUKA].  F0U1VA)  KNIT. 


Kach  of  the  interconnect  inf  wires  in  the  three  fundamental  topologies 

can  be  associated  with  two  direction  vectors,  one  pointing  along  the  wire  in 

one  way,  the  other  in  the  reverse.  This  makes  a  total  of  three 

possibilities  for  each  interconnecting  wire:  (i)  +  d,  (ii)  -d,  and 

(iii)  +  d.  This  means  that  the  linear  topology  results  in  31  =  3 

2 

architectures,  the  rectangular  topology  in  3  =  9  architectures  and  the 

3 

hexagonal  topology  in  3  =27  architectures.  Since  many  of  these 

architectures  are  equivalent,  a  classification  of  the  distinct  architectures 
is  provided  in  Table  3-1.  The  nomenclature  consists  of  a  capital  letter 
(L,  R  or  H)  indicating  the  topology  (linear,  rectangular  or  hexagonal),  a 
digit  indicating  the  number  of  dependence  vectors  and  a  lower  case  letter, 
whenever  required,  to  distinguish  between  architectures  which  have  the  same 
topology  and  the  same  number  of  dependence  vectors  but  are  not  equivalent, 
e.g.,  H3a  and  H3b.  The  table  lists  all  equivalent  configurations  in  a 
single  row. 


3.3  PERIODICITY  ANALYSIS  AND  THROCGHPDT 

The  occurrence  of  cycles  (i.e.,  closed  loops  of  directed  arcs)  in  the 
directed  graph  representing  a  hardware  architecture  provides  important 
information  about  the  throughput  rate  of  the  architecture.  In  this 
subsection  we  analyze  this  information  and  identify  the  configurations  with 
low  throughput. 

The  periodicity  index  n  of  architectures  has  been  defined  in  Section 
2.2.  It  can  be  computed  either  by  examining  undirected  loops  in  the  graph 
representing  the  architecture  or  by  solving  the  equation 

i»D#  -  0  (3.2) 

for  every  possible  row  vector  q  with  integer  elements,  and  suaiming  the 
elements  of  it.  The  periodicity  index  n  equals  the  smallest  of  these 
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sums.  If  no 

sol ut i on 

q  exists,  n  is  defined  to  be  1.  Following  this 

technique  we 

concl udc 

that  LI,  K2  have  no  solution  and  have 

a  unit 

pe  r iodi c  i  ty 

inde  x 

,  while  other  arch  i  t c c t ut e s  have  solutions, 

as  follows: 

(i) 

L2 

has 

i)  =  [1  1]  ;  hence  jt  =  2 . 

(ii) 

R3 

has 

q  =  [1  1  0]  ;  hence  ji  =  2 . 

( iii) 

R4 

has 

q  =  [1  1  0  0] ,  [0011];  hence  n  =  2 

• 

(iv) 

H3  a 

has 

q  =  [1  1  -1];  hence  n  =  1. 

(v) 

H3b 

has 

q  =  [1  1  1] ;  hence  n  =  3 . 

(vi) 

H4  a 

has 

q  =  [1  1  -1  0] .  [0011],  [1101]; 

hence  n  =  1 . 

(vii) 

H4b 

has 

q  =  [1  -1  0  1],  [0011];  hence  n  = 

1. 

(viii) 

H5 

has 

q  -  [1  0  1  0  -1],  [11000],  [0011 

0] ;  hence 

n  = 

1. 

(ix) 

H6 

has 

n=  [1  01  0-1  0],  [110000],  [00 

110  0], 

[0  0 

i  0  0 

1  1] ;  hence  n  *  1. 

In  the  sequel  we  shell  measure  the  throughputs  of  srchitectures 
relative  to  the  throughput  of  the  linear  architecture  LI  (a  classical 
pipeline).  Since  the  tine  interval  between  two  successive  applications  of 
input  equals  the  periodicity  index,  the  relative  throughput  of  a  given 
architecture  is  given  by  the  formula 

relative  throughput  “  - .  . ,  fr — r—: —  (3.3) 

•  periodicity  index 

Thus,  the  relative  throughput  of  L2,  R3,  R4  is  1/2  and  that  of  H3b  is 
1/3. 
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3.4  BOUNDARY  ANALYSIS 


No  8  s  sum  jit  i  on  has  been  n.aik  u]>  to  this  poiiit  about  the  shape  ol  tlx 
bouudur)  ol  a  given  hardwire  architecture.  However,  sir.ee  the  shape-  of  the 
boundary  is  chanced  by  linear  transformation  it  has  to  be  taken  into 
consideration  in  the  process  of  classifying  architectures.  As  an  example 
consider  the  6  equivalent  configurations  denoted  by  H3a  (Table  3*1).  Tht 
truncated  dependence  matrices  of  the  first  and  third  of  these  configurations 
are  re  If.  ted  by  a  linear  transformation,  viz. 


1  1 

1 

o 

H 

1 _ 

r>  ‘l 

0  1 

It 

1  1 

- 1 

o 

T— 1 

1 

1 

O 

H 
'  1 

1 

H 

o 

_ 1 

Now  assume  that  the  first  configuration  has  a  rectangular  boundary,  which 
can  be  characterized  by  boundary  matrix 

1  o" 

0  1 

consisting  of  all  dependence  vectors  colinear  with  the  boundary.  The  linear 
transformation  maps  this  boundary  into 


r 

i 

i" 

1 

1 ' 

B  «  B 
s  s 

-i 

0 

X 

-1 

0 

which  characterizes  a  parallelogram  rather  than  a  rectangle.  Thus,  the 
first  H3 a  configuration  with  a  rectangular  boundary  is  equivalent  to  the 
third  B3a  configuration  with  a  parallelogram  boundary.  It  is  not 
equivalent,  however,  to  the  third  H3a  configuration  with  a  rectangular 
boundary.  Clearly,  we  need  to  reclassify  the  entries  of  Table  3-1  according 
to  both  the  dependence  matrix  and  the  boundary. 

Ve  shall  be  concerned  only  with  boundaries  that  satisfy  the  two 
following  conditions: 

(i)  The  boundary  curve  is  a  closed  convex  polygon 
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(ii)  Kath  segment  of  the  boundary  curve  is  colinear  with  some 
tit  I'rndr nee  vector. 

Thus,  the  only  possible  directions  for  the  segments  of  the  boundary  curve 
are  [1  0],  [0  1]  and  [1  1],  Consequently,  there  are  four  possible 
boundary  curves  (Figure  3  —  5 )  :  rectangle,  parallelogram,  lower  triangle, 
upper  triangle.  Of  these,  only  the  rectangle-shape  boundary  can  be  applied 
to  the  linear  (1)  and  rectangular  (R)  architectures.  On  the  other  hand,  all 
four  possible  boundaries  can  be  combined  with  hexagonal  (II)  architectures. 
However,  since  linear  transformations  map  rectangles  into  parallelograms  and 
lower  triangles  into  upper  ones,  we  need  only  consider  the  combination  of 
each  hexagonal  entry  of  Table  3-1  with  either  a  rectangular  or  a  triangular 
boundary. 


Rectangle 


b.  Parallelogram 


Figure  3-5.  Fundamental  Boundary  Curves 


With  rectangular  boundaries  »c  need  to  consider  matrices  of  the  form 


- 

r 

1' 

l» 

y 

B 

1 

s 

_ 

Clearly 


1) 

-D  1 

-D 

s 

* 

s 

.  1- 1) 

- 

I 

.’I  J 

1 

which  shows  that  the  reversal  of  all  dependence  vectors  does  not  produce  a 
new  configuration.  The  6  entries  in  each  one  of  the  rows  H3a,  H4a,  H4b,  H5 
of  Table  3-1  can,  therefore,  be  considered  as  3  pairs  of  conjugate 
configurations.  Of  these,  the  second  and  third  pair  are  still  equivalent 
when  combined  with  rectangular  boundaries,  but  the  first  pair  is  different. 
Thus,  the  entries  of  Table  3-1,  when  combined  with  rectangular  boundaries, 
can  be  reclassified  as  in  Table  3-2.  This  time  each  architecture  is 
specified  by  its  matrix  rather  than  by  a  pictorial  description  as  in 

Table  3-1. 

Similarly,  we  can  combine  each  hexagonal  entry  of  Table  3-1  with  a 
lower  triangular  boundary.  This  will  again  produce  two  distinct 
architectures  for  each  one  of  the  rows  H3a,  H4a,  H4b,  H5 .  However,  there 
is  no  need  to  do  it  explicitly,  since  the  resulting  configurations  can 
always  be  obtained  by  'cutting'  the  appropriate  hexagonal  topology  combined 
with  a  rectangular  boundary  along  the  main  diagonal.  Thus,  it  will  be 
sufficient  to  focus  in  the  sequel  upon  rectangular  boundaries  alone. 


3.5  SUMMARY 

Systolic-array-like  architectures  have  been  classified  by  topology, 
interconnection  pattern  and  shape  of  boundary.  We  have  shown  that  there  are 
only  15  distinct  (non-equivalent)  architectures  (see  table  3-2).  We  have 
also  shown  that  it  is  sufficient  to  consider  only  rectangular  boundaries 
which  are  of  practical  importance  in  the  process  of  VLSI  layout. 
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SIX']  JON  4 


CLASSIFICATION  01  SPAC1  TIME  kl  I'LI-sm A1_J(>N 


The  space- time  representation  of  a  coin  j>]  e  t  e  1  y  regular  VC K  was 
characterized  ir  the  previous  section  by  the  dependence  matrix  J>.  Tht 
hardware  configuration  was  obtained  by  focusing  upon  the  spatial  coordinates 
of  the  dependence  vectors,  which  resulted  in  the  truncated  dependence  matrix 
P  .  It  was  observed  that  the  temporal  coordinate  of  all  the  architectures 
described  in  Section  3  was  always  equal  to  1,  viz.. 


(4.1) 


so  that  the  dependence  matrix  D  can  be  easily  reconstructed  for  any  given 
via  (4.1).  The  properties  of  the  corresponding  space-time  diagram  can 
then  be  deduced  by  analysis  of  the  dependence  matrix  D. 


4.1  THE  FUNDAMENTAL  SPACE-TIME  CONFIGURATIONS 

Each  of  the  fundamental  11  architectures  of  Table  3-2  determines  a 
fundamental  space-time  configuration.  We  shall  focus  our  attention  upon  the 
dependence  matrix  alone,  without  considering,  for  the  present,  the  shape  of 
the  boundary  surface.  Thus,  equivalence  between  the  fundamental  space-time 
conf ignra ti ons  is  established  by  relating  the  corresponding  dependence 
matrices  by  linear  transformations.  A  simple  analysis  (see  Appendix  A) 
shows  that  every  dependence  matrix  with  2  vectors  can  be  transformed  into 
the  equivalent  (canonical)  form 

1  0  (  ‘ 

0  10 
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■i  n  vi  mi;  ill  i  i  in’.i  i  .  i  r  n  ii  t  i  i  x  y  i  I  1.  3  m  itii  >  i  a  l  l  ■  i  t  l  aii  .s  1  ii  n  i  d  l  i  t  i  1 1. 1 

l  l|  I,  I  V  ■■  1  l  I  t  till. 


1  ('  0 

l'('*  0  1  0 

0  0  1 

L  J 


Consequently ,  L2  -  R2  and  R3  ~  113  a  ~  H3b  where  the  tilde  (-)  denotes 
equivalence.  For  dependence  matrices  with  more  than  3  vectors  it  is 
convenient  to  establish  first  a  (ncnunique )  canonical  equivalent,  i.e.,  an 
equivalent  dependence  matrix  whose  first  three  rows  are  the  identity  matrix, 
viz  .  , 


D 


10  0 
0  10 

0  0  1 

X 


Some  canonical-form  equivalents  are  listed  in  Table  4-1.  The  full  list  of 
canonical  equivalents  will  be  discussed  in  l8ter  sections  in  conjunction 
with  the  specification  of  boundary  surfaces  in  the  three-dimensional  space- 
time  continuum. 


4.2  ARCHITECTURES  WITH  LOCAL  MEMORY 

The  preceding  analysis  was  based  upon  the  assumption  that  processors 
transmit  the  results  of  computations  to  their  immediate  neighbors  and  never 
store  them  for  further  use.  However,  many  applications  do  involve  such 
storage;  this  is  true,  in  particular,  for  adaptive  system/parameter 
identification  algorithms  that  store  the  identified  parameters  in  fixed 
location  within  the  array  and  use  the  signals  that  flow  through  each 
processor  to  time-update  the  locally  stored  parameters.  In  this  section  we 
consider  the  architectures  obtained  by  providing  each  processor  with  a  local 
memory . 
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Topologically,  local  mi-nory  n.i  u  n  s  ihc  addition  ii  a  at  1  1  loop  to  i-uili 
|  lot  Sm'I  I  I  j  t:  l  t  A  1  1  .  lit  o  1  I  t  >.  t  1  >’l.  t'l  t  n  t  I  )i  It  l  tiiiii  t  1  i  l.i  t  Ilf  t  ul 
.still  It-  v  I  >  s  c  i  i  i  3  il  ;  ft  i  k  t  v  ;•  \  s  ,  a  >  i>  ]  laired  i  i  St  i  t  i  on  3  .  ?  ,  r  t  so  1  t  i  iij 
ili  11  m*  a  i  ch  i  t  e  c  t  ui  e  s  Ciabli  4  2).  Iwo  important  tbicnuticiu  have  to  bt 
made  regarding  this  table: 

(i)  The  number  of  dependence  vectors  is  larger  by  one  than  the  number 
of  interconnections.  Thus,  for  instance,  Rf!3  has  4  direction 
vectors,  not  3 . 

(ii)  The  length  of  the  last  dependence  vector,  corresponding  to  the 
local  memory,  equals  the  temporal  displacement  between  two 
consecutive  occurrences  of  the  same  processor  in  the  space-time 
configuration.  Thus,  in  general,  this  displacement  is  1,  except 
for  L2 ,  R3 ,  R4  whose  temporal  displacement  is  2  (corresponding  to 
a  periodicity  index  of  2),  and  except  for  H3b  whose  temporal 
displacement  is  3  (corresponding  to  a  periodicity  index  of  3). 

Local  memory  can  also  be  used  to  interleave  computations  and  achieve 
increased  throughput  with  architectures  whose  relative  throughput  without 
memory  is  less  than  1.  This  possibility  will  be  discussed  in  Section  4.4. 

Analysis  of  equivalence  between  space-time  configurations  with  local 
memory  reveals  that: 

(i)  LM1,  which  has  2  linearly  independent  dependence  vectors,  is 
equivalent  to  L2 ,  R2 . 

(ii)  RM2,  which  has  3  linearly  independent  dependence  vectors  is 
equivalent  to  R3 ,  H3a,  H3b. 

(iii)  HM3a,  which  has  4  dependence  vectors,  is  equivalent  to  R4 . 

In  all  three  case*  we  can  trade  interconnecting  links  for  memory,  thereby 
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The  Linear  Topology  with  Memory  (OI) 


The  Rectangular  Topology  with  Memory 


(RM) 


(HM) 


The  Three  Fundamental  Topologies  with  Local 


1' All  1.1.  4  2.  nil  MNI'AMl  N'l  Al  IIAHOW  AKI.  AK<  II  111  Cl  liKLS  WJ11I  LOCAL  W.MOKV 


LM1 

1 _ 

LM2 

Ri!2 

RM3 

RM4 

1 
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1 
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0  1 
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-1 
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1 
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-1 
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-1 

0 

0 
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0 

1 

0 

1 

1 

0 

1  1 

2 

0 

-1  -1 

0 

0  2 

reducing  the  number  <>1  physical  wires  required  to  construct  a  realization  of 


t  h  v  aril:  i  tic  I  uri  a  ml  s  i  p  pi  1  fy  1 i>{  (to  layout  pi  <  b 1 rm  lor  VIS  I  i  irpl  f  m<  n  t  »  t  i  on . 
Thus,  for  instance,  the  R2  architecture  which  requires  a  planar  network  of 
processors  with  4  interconnecting  ports  at  each  processor  can  be  replaced  by 
LM1  which  requires  a  linear  network  of  processors  with  2  interconnecting 
ports  at  each  processor  and  a  local  memory.  Even  more  remarkably,  the  same 
replacement  also  trades  low  throughput  configurations  for  high  throughput 
one  s . 


4.3  BOUNDARY  ANALYSIS 

The  relation  between  boundary  shapes  and  equivalence  between  (planar) 
architectures  has  been  examined  in  Section  3.4  The  combination  of  topology 
and  boundary  has  produced  15  distinct  architectures  which  were  summarized  in 
Table  3-2.  Since  each  one  of  these  architectures  has  a  rectangular 
boundary,  the  resulting  space-time  configuration  always  occupies  a 
rectangular  prism  (with  the  exception  of  low-dimensional  architectures  such 
as  LI,  L2,  R2  whose  space-time  configurations  occupy  1  or  2-dimensional 
subspaces ) . 

Since  linear  transformations  change  the  shape  of  the  boundary,  the 
equivalence  between  space-time  configuration,  discussed  in  Sections  4.1  - 
4.2,  has  to  be  reexamined  to  include  the  effects  of  boundary 
transformations.  It  will  be  sufficient  to  carry  out  this  analysis  only  for 
collections  of  space-time  configurations  which  have  been  declared  as 
equivalent  in  the  preceding  sections. 


4.3.1  The  Conf ieurations  LM1,  L2,  R2 

The  configurations  LM1,  R2  can  be  considered  equivalent  only  when  we 
assume  that  a  single  set  of  inputs  is  applied  to  R2  (rather  than  a  time- 
series).  In  this  case  R2  is  characterized  by 


n 

10  1 
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Oil 
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10  1 

Oil 

31 


while  1.M1  is  iliarai  ttri'/t'd  b> 


1  0  1 

(I  0  1 


and  the  two  are  equivalent,  being  related  by  a  lineai  transformation,  viz. 


i  c>  r 

0  0  1 


1  -1  0 

0  1  (' 

(I  1  1 


[1 1 :] 


On  the  other  hand,  LM1  and  R2  are  not  equivalent  to  L2  for  which 


10  1 

-1  2  1 

10  1 

0  0  1 


The  D-part  of  this  characterization  can  be  related  to  the  D-part  of  LM1 , 


[i  o  i"|  f  i  !  .  f  i  o  i] 

L°  °  ij  -i  o  i  h  0  jJ 
- 


where  the  asterisks  denote  entries  which  can  be  chosen  arbitrarily  (subject 
to  the  nonsingnlarity  constraint  of  the  linear  transformation).  However, 
when  the  dependence  matrix  is  combined  with  the  boundary  matrix  we  obtain 


10  1 
0  0  1 

1  0  l' 

0  0  1 


2  0  0 
0  10 
-10  1 


10  1 
-10  1 
~1  0  1 
-10  1 


which  does  not  match  the  B-part  of  L2.  When  the  inverse  of  this 
transformation  is  applied  to  the  dependence  and  boundary  matrices  of  L2, 
viz . , 
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tli»-  resulting  configuration  (Irguit-  4  2)  js  eq  u  i  \  a  1 1  n  t  to  an  1.M1 
i  on  I  i  g  u  i  a  t  l  <  it  of  1 1. 1  1 1. 1  t  i  or  i!i  i  ,  l  lie  I  ihiti  in  t  iw  ]  i  i  t  if  l  l.i  a  i  i  )'  i  1 1  i  t  n  i  t 
is  sli  rifl'd  cm  cell  to  tin-  right  t\tr\  time  a  run  ii.jmt  is  applied.  Thus, 
in  summui},  1,M1  and  K2  arc  equivalent  to  each  other  but  not  to  L2  . 


4.3.2  The  Conf  iterations _ RM2.,  R3_,  H_3  a_^  II3b 

The  truncated  boundarj  matrix  of  these  configurations  mas  chosen  in 
Section  3.4  as 


namely,  the  rectangular  boundary.  The  corresponding  boundary  surface  in  the 
space-time  continuum  is,  therefore,  characterized  by  the  boundary  matrix 

‘l  0  1 

It  =  0  1  1 

0  0  1 

When  this  boundary  matrix  is  combined  with  the  dependence  matrices  of  RM2, 
R3 ,  H3aa,  H3ap,  H3b,  equivalence  is  destroyed.  For  instance,  trying  to 
relate  H3 aa  to  RM2  sre  obtain 


‘  1  0  1  ‘ 
Oil 
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“l  0  ~1 
Oil 
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0  -1 
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■  1  0  1  ■ 
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111. 


The  resalting  D-part  coincides  with  the  dependence  matrix  of  RM2 ,  but  the 
boundary  surface  is  different.  The  configuration  obtained  above  by 
transforming  H3aa  is  in  fact  an  RM2  hardware  of  infinite  order  in  which 
a  finite  active  segment  shifts  along  the  diagonal  by  one  cell  each  time  a 
aet  of  inputs  is  applied  to  the  array.  This  is,  in  fact,  precisely  what 
happens  in  systolic  arrays  for  matrix  mul tipi ication.  The  conf ignration 
B3ao  (of  Weiser  and  Davis  [12])  is  suited  for  multiplying  banded  matrices. 
When  the  same  problem  is  implemented  on  an  RM2  configuration  (of  S.Y.  lung 
[13])  moat  cells  in  the  array  are  idle  while  a  small  active  rectangle. 
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correspond! ng  to  tlie  bandwidth  of  the  given  matrices,  shifts  along  the  main 
diagonal  el  the  a  I  i  a  \  .  In  analog),  a  h  1 1  e  nml  l  j  pi  >  i  i.g  1  v>  e  n.a  t  i  1  i  t  :  a  1  t  (.  m 
stiucture  is  carried  out  efficiently  by  an  RM2  array,  solving  tbe  same 
problem  on  an  113a  configuration  involves  many  idle  cells  and  a  small 
active  segment  that  shifts  along  the  main  diagonal. 


4.3.3  The  Conf  iterations  II M3 _R4 

These  configurations  have  the  same  boundary  matrix,  given  by  (4.1),  as 
RM2 ,  R3 ,  H3a  and  H3b.  Since  their  dependence  matrices  are  different,  we 
conclude  that  HM3aa,  HM3 a p ,  R4  are  distinct  configurations  when  the  shape 
of  boundary'  surface  is  taken  into  account. 


4.3.4  Summary 

When  boundary  considerations  are  taken  into  account  each  of  the  15 
architectures  of  Table  3-2  is  distinct  and  cannot  be  related  by  equivalence 
to  any  other  architecture  in  this  table.  Incorporating  local  memory  results 
in  doubling  the  total  number  of  distinct  configurations  to  30. 


4.4  INTERLEAVING  ARCHITECTURES  BY  LOCAL  MEMORY 

The  introdnetion  of  local  memory  in  Section  4.2  involved  the  assumption 
that  locally  stored  data  remain  in  memory  until  required,  which  makes 
particular  sense  in  data  driven  realization.  Consequently,  the  duration  of 
storage  for  some  architectures  (L2,  R3 ,  R4 ,  H3b)  was  longer  than  one  time 
unit.  This  fact  can  be  used  to  construct  new  architectures  with  higher 
throughput,  by  interleaving  computations  in  time  and  connecting  the 
interleaved  computations  via  the  local  memory. 

The  simplest  example  of  such  construction  is  the  architecture  L2. 
Without  interleaving  the  throughput  of  L2,  LM2  it  1/2  (Figure  4~3a). 

With  interleaving,  which  involves  superimposing  in  time  two  L2  schemes  snd 
interconnecting  them  via  local  memory,  the  resulting  Li M2  conf igurstion 
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(figure  4  3b)  has  throughput  1.  A  similar  approach  produces  the 
c  i  ch  i  1 1  c  t  ui  t  s  Ki  M3 ,  K  iV4  ai.d  lliK3b.  win. si  ilitiailti  i/uliwi'  an  five  i.  i  i 
Table  43.  The  difference  between  these  architectures  and  theii 
noninterl eaved  counterparts  is  the  shortening  of  the  local  memory  dependence 
vector  from  either  [0  0  2]  or  [0  0  3]  to  [0  0  1], 

TABLE  4-3.  DEPENDENCE  MATRICES  FOR  INTERLEAVED  ARCH ITECTURES 


1 

!  LiM2 

| _ 

RiM3 

RiM4 

HiK3b 

1 

f 

!  101 

10  1 

1  0  1 

10  1 

-10  1 

-10  1 

-10  1 

011  1 

0  0  1 

I 

0  11 

0  0  1 

0  11 

0-1  1 

0  0  1 

-1-11  1 
0  0  1 

4.5  SUMMARY 

Space-tine  configurations  have  been  classified  by  topology, 
interconnection  pattern,  shape  of  boundary,  existence  of  local  memory  and 
interleaving.  The  15  fundamental  architectures  of  Table  3-2  give  rise  to 
another  15  configurations  involving  local  memory.  These,  in  turn,  give  rise 
to  4  interleaved  configurations,  producing  a  total  of  34  distinct  space-time 
configurations. 


Igimriit  tin  sliai'i  ol  the  beuiuluiy  suilmi  results  l  ii  20  distinct 


ii!'!  i  -t  \ '  t  ;  I  ik  ) 

i) 

11 

1  1  ) 

114  au , 

114  a  j 

: ) 

I  Ml,  12 

,  R2 

12  ) 

IMbu, 

114  bp 

3) 

1.M2 

13) 

RM4 

4  ) 

I.iV2 

14) 

R  iM4 

5) 

RH2,  R3 

,  113 ao,  113 ap,  l!3b 

15) 

HM4  au 

,  11M4  a  (j 

6) 

RM3 

16) 

l!V4ba 

,  HM4bp 

7) 

RiV3 

17) 

H5u, 

H5p 

8) 

BM3ao, 

BM3ap,  R4 

18) 

HM5a, 

HM5  J; 

9) 

11*13  b 

19) 

1st- 

10) 

HiK3b 

20) 

KM  6 

Ignoring,  in  addition,  the  details  of  local  memory  (and,  consequently,  of 

interleaving)  results  in  8  distinct  configurations  only  as  in  Table  4-1. 

Choosing  the  optimal  configuration  for  a  given  computational  scheme 

requires  a  specification  of  both  the  interconnection  pattern  and  the 

boundary  shape.  This  can  be  accomplished  only  when  specific  details  of  the 
corresponding  computational  scheme  are  taken  into  account  (e.g.,  handedness 
of  matrices  to  be  multiplied).  When  only  partial  information  is  considered 
the  designer  is  often  able  to  choose  the  interconnection  pattern  but  not  the 
boundary.  Thus,  multiplication  of  two  matrices  can  be  implemented  in  any  of 
the  five  equivalent  hardware  configurations  RM2  113],  R3  [14],  H3aa  [12], 
B3»P,  B3b  [11].  However,  RM2  will  be  optimal  if  both  matrices  have  no 
particular  structure;  R3  will  be  optimal  if  only  one  of  the  matrices  is 
banded;  and  H3aa  (or  H3a£)  will  be  optimal  if  both  matrices  are  banded. 
It  is  an  historical  curiosity  that  the  first  systolic  array  for  matrix 
multiplication,  H3b,  is  never  optimal,  because  it  has  relative  throughput 
of  1/3  and  is  otherwise  equivalent  to  H3a. 
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SECTION  5 


CON  Cl.  US  ION’S 

A  classification  of  canonical  realizations  for  completely  regular 
modular  computing  networks  has  been  presented.  Three  levels  of  abstraction 
were  considered:  topology,  architecture  tnd  space-time  representation.  The 
ana  lysis  revealed  3  canonical  topologies,  15  canonical  architectures  and  34 
canonical  space-time  configurations.  It  was  shown  that  the  unique  canonical 
counterpart  of  any  given  topology,  architecture  or  space-time  configuration 
is  obtained  via  a  simple  (and  unique)  transformation  of  the  corresponding 
dependence  and  boundary  matrices.  It  was  also  shown  that  only  rectangular 
boundaries  are  required  to  implement  any  canonical  realization.  While 
ignoring  boundary  details  allows  some  flexibility  of  design,  it  also  results 
in  inefficient  implementations,  as  explained  in  Section  4.5. 

It  is  interesting  to  observe  that  only  a  small  fraction  of  the 
architectures  described  in  this  memo  have  actually  been  used  in  the  design 
of  parallel  algorithms.  The  most  commonly  encountered  architectures  are  the 
linear  ones  ( L2 ,  L1M)  which  are  used  for  linear  filtering  (  =  convolution, 
polynomial  multiplication)  and  related  computations.  Next  comes  the 
rectangular  architecture  RM2  and  its  equivalents — R3 ,  03a,  H3b — which  are 
used  in  matrix  products,  matrix  triangularizations,  solutions  of  linear 
equations,  QR-factorizations  for  eigenvalue  problems,  and  adaptive 
multichannel  least-squares  algorithms.  Thus,  all  applications  involved,  to 
date,  are  only  architectures  with  3  dependence  vectors  or  less.  Notice  al ;o 
that  the  classical  pipeline  (LI)  has  no  use  as  a  signal  processing 
architecture. 

The  concept  of  completely  regular  MCMs  involves  topologies  which  are 
mosaics  or  completely  regular  graphs  (excluding  the  boundaries).  When  this 
requirement  is  relaxed  to  allow  regular  (but  not  completely  regular)  planar 
graphs,  a  large  variety  of  new  architectures  becomes  feasible,  including 
regular  trees,  self-similar  graphs  (corresponding  to  self-recursive 
algorithms)  and  triangular  mosaics.  Such  configurations,  which  occur  in 
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APPENDIX  A 


E0P1VAI.FNIT.  VIA  I.  INI' AX  TPANSFOFiPAT  JONS 


TVo  dependence  matrices,  say.  p  ,  I'0  ,  arc  considered  equivalent  when 
there  exists  a  nonsingular  linear  transformation  1  and  a  permutation 
matrix  P  such  that 


( A .  1  ) 


This  relation  is  clearly  reflexive  (with  P  -  I,  T  =  I),  symmetric  and 
transitive,  so  'equivalence'  is  indeed  an  equivalence-type  relation. 

Denoting  the  length  of  dependence  vectors  by  n,  and  the  number  of 
dependence  vectors  by  p,  we  conclude  that  every  dependence  matrix  with 
p  <_  n  and  full  (row)  rank  is  equivalent  to 


»<F>  •  = 


-  [I  0] 
P 


(A. 2) 


which  will  be  defined  as  the  canonical  eqnivalent  of  snch  dependence 
matrices.  When  p  >  n,  and  the  dependence  matrix  has  full  (column)  rank, 
we  can  always  find  a  permutation  matrix  P  so  that 


’l‘ 
_X  _ 


(A. 3) 


where  T  consists  of  the  first  n  rows  of  the  permuted  matrix  PD.  Thus, 
the  canonical  equivalent  of  dependence  matrices  with  p  >  n  is  of  the  form 
(A. 3)  and  the  properties  of  D  can  be  studied  by  examining  the  structure  of 
the  smaller  matrix  X. 

However,  since  the  submatrix  X  in  (A. 3)  is  not  unique,  it  is  required 
first  to  find  all  possible  canonical  equivalents  to  a  given  dependence 
matrix  D.  This  can  be  done  by  applying  all  possible  p!  permutations  P 
to  the  rows  of  D  and  then  computing  X  via  (A. 3).  However,  not  all 


lx  rn  Utah  on  s  ftncrbtt  distinct  X  niati  icts.  In  juti  titular,  il  ajj'l>  a 

1  i  i  i*  i  t  ;  I  M  r  (1  l  )<  t  !ii:- 


*1  “ 


(A. 4) 


where  Pj ,  P^  arc  permutation  matrices  of  sizes  nxn  and  (p-n)  x  (p-n), 
re spc c t i v c 1\ ,  the  resulting  canonical  equivalent  is  obtained  by  solving  the 
cquat i on 


P1  °  °1 

O  P„  IK 


which  implies  that  T  =  P^  P  and  (assuming  is  nonsingular) 

X  - 

Thus,  X  is  simply  some  row  and  column  permutation  of  the  fundamental 
solution  D2D1^*  ol)ta*n  X-matrices  that  are  not  permutations  of  the 

fundamental  solution  it  is  necessary  to  choose  a  permutation  matrix  P 
which  does  not  have  the  block  diagonal  form  (A. 4).  There  are  only  P  ways 
of  doing  so,  which  is  much  less  than  (p!).  Moreover,  not  all  of  these 
choices  result  in  distinct  X-blocks.  Thus,  for  instance,  the  dependence 
matrix 


10  1 
-10  1 
Oil 
0-11 


has  only  one  canonical  equivalent,  viz.. 


10  0 
0  10 
0  0  1 

1  -1  1 


while  there  are  in  general  24  posaible  permutations  of  its  rows  and  4  ways 
to  choose  these  permutations  in  a  form  that  differs  from  (A. 4). 


In  summary,  once  all  possible  canonical  equi  val  t  nt  s  of  e  given  I>  bev 


bee  n  coa.pi  lit'  it  is  relatively  easy  to  list  v  be  1 1. 1 
nntii)  l>  is  equivalent  to  1>.  One  only  needs  t( 
canonical  equivalent  of  D  and  compare  it  to  the 
equivalents  of  D:  a  match  indicates  that  I)  is 


1  sor.ii  i  t  1  e  f  e'e  ]  i  ndi  M  i 
■  comput e  a  single 
collection  of  canonical 
indeed  equivalent  to  D 


